Main analysis (Exploratory Data Analysis)
1. Analysis on Age distribution of our customers
df_2month = full_join(summary_18_6, summary_17_6, by = "factor_year")
df_3month = full_join(df_2month, summary_16_6, by = "factor_year")
t1 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.x, time = df_3month$time.x)
t2 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq.y, time = df_3month$time.y)
t3 = data.frame(factor_year = df_3month$factor_year,Freq = df_3month$Freq, time = df_3month$time)
tf = rbind(t1,t2,t3)
ggplot(tf, aes(x = factor_year, y = Freq, color = as.factor(time))) + geom_point() +
coord_flip() + xlab("user's birth year") +
ggtitle("Frequency of Citi Bike users' birth year in different months")+
labs(fill="Time") + theme_grey(16)
This Cleveland Dot Plot shows the distribution of user’s birth year in 2016/6, 2017/6 and 2018/6. They are all left-skewed, which is what we expect. It is glad to see number of users for all ages increases from 2016 to 2018. In 2018 June, the active users are aged between 24 and 37.
set.seed(321)
index = sample(nrow(df), round(nrow(df)*0.001))
sample = df[index,]
g = ggplot(sample, aes(birth.year,minute))
g + geom_point(alpha = 0.5, col = "#fb6a4a") + ylab("trip duration (in min)") +
xlab("users' birth year") + ggtitle("Scatterplot of uses' birth year and trip duration (in sec) in Jun 2018") + theme_grey(16)
There are 1.945,611 observations in our data set, so we sampled 1% of them to see the relationship between user’s birth of year and trip durations.
We were expect to see an increasing and then decreasing trend as in the distribution of users’ birth year (right-skewed). However, there is no obvious trend in the scatterplot. Again, since NA birth year were transfered to 1969, there are a lot observations in 1969 and its covers a wide range of trip durations, which is misleading.
2. Analysis on Gender of our customers
gender = data.frame(group = c("unknown", "male", "female"),
value = summary(as.factor(df$gender)))
pie = ggplot(gender, aes(x="", y = value, fill=group)) +
geom_bar(width = 1, stat="identity") +
coord_polar("y", start = 0)
blank_theme <- theme_minimal()+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid=element_blank(),
axis.ticks = element_blank(),
plot.title=element_text(size=14, face="bold")
)
pie + scale_fill_manual(values=c("#fbb4ae", "#b3cde3", "#ccebc5")) +
blank_theme + theme(axis.text.x=element_blank()) +
geom_text(aes(y = value/3 + c(0, cumsum(value)[-length(value)]),
label = c("10%","66%", "24%")), size=5) +
ggtitle("Ratio of users' gender of Citi Bike ") + theme_grey(16)
There are 66% male users, 24% female users and 10% unkown. Apprently, there is a big differnece betwee female users and male users. We have seen some news report that Citi Bike users sometimes complain about the weight of citibike. Maybe that is one of the reasons that female users are small. Thus, we can think of some strategies for female users to increase profits because there is indeed room for improvement.
3. Analysis on Station Distribution and Corresponding Number of Bikes
Below is the heatmap of number of trips for each station. We also label the top 10 popular stations
## get station info
station.info <- data_18_6 %>%
group_by(start.station.id) %>%
summarise(lat=as.numeric(start.station.latitude[1]),
long=as.numeric(start.station.longitude[1]),
name=start.station.name[1],
n.trips=n())
pal = colorNumeric(
palette="YlOrRd",
domain = station.info$n.trips
)
top10 = station.info %>% dplyr::arrange(desc(n.trips))
icon <- makeIcon(
iconUrl = ".img/citi_bike_icon.png",
iconWidth = 20, iconHeight = 20,
iconAnchorX = 22, iconAnchorY = 22,
shadowWidth = 20, shadowHeight = 20,
shadowAnchorX = 4, shadowAnchorY = 62
)
leaflet(station.info) %>%setView(-74.00, 40.71, zoom = 12) %>%
leaflet::addProviderTiles("CartoDB.Positron") %>%
leaflet::addCircles(lng = ~long, lat = ~lat, color =~pal(n.trips)) %>%
leaflet::addLegend("bottomleft", pal = pal, value = ~n.trips, title = "Num of trips starting from this station") %>%
addMarkers(lng = ~top10[1:10,]$long, lat = ~top10[1:10,]$lat, icon = icon)
We can see that there are more users starting their trips in Manhattan. In Manhattan, Midtown is the most popular area. The station near Central Park is also a popular spot, probably because there are some tourists riding inside Central Park. Back to our first question in the introduction, we are wondering why we do not see a lot of people around campus riding with Citi Bike. The number of stations around campus is not limited, but the starting rides from those stops are small. One possible explanation we believe is that there are quite amount of students living in dorms or very close to campus. So they do not need bicycles to transport.
4. Analysis on weather and season
Join weather data to see if weather has an impact
Intuitively, It is reasonable to assume that weather and season has a huge impact on CitiBike frequency.
NYCweather<-read.csv("./data/NYCweather.csv")
df$Date <- as.Date(df$starttime)
NYCweather$Date<-as.Date(NYCweather$Date)
df_weather <- inner_join(df,NYCweather,by="Date")
df_weather$Severe <- as.factor(df_weather$Severe)
ggplot(df_weather, aes(x=Date))+
geom_bar(position='dodge', colour="white", fill="#fb6a4a")+
ggtitle("Number of trips by day") + theme_grey(16)
We made a join of Citi Bike trip data and NYC Weather data of June 2018 to see if weather has an impact on trips. In the histogram below, we showed the number of trips corresponding to each day in June 2018. As we can observe, June-26 has abnormally more trips than other days. One explanation is that maybe missing data is automatically imputed as June01, or maybe there is some special events or promotions on that day, which caused an increase in the number of trips.
In the NYC Weather dataset, a severe weather is defined as either rainy, foggy or snowy. We compared the number of trips corresponding to severe and unsevere weathers, and we can see a difference here. Notice that the number of days with severe weather and days with unsevere weather are both 15, so this comparison can reflect the difference between these two levels.
ggplot(df_weather, aes(x=Severe))+
geom_bar(colour="white", fill="#fb6a4a")+
ggtitle("Number of trips per day for severe and unsevere weather condition") +theme_grey(16)
We then visualized the average number of trips per day in different temperatures, windspeeds, and precipitation. These visualizations are limited by the precision of NYC Weather dataset – only average windspeed, temperature and precipitation each day is available, and since only 30 days of data is included, we fail to observe a significant pattern here. However, there are peaks for each metric - the number of trips peaks at tempeature around 72, windspeed around 3.8 m/s, and precipitation of 0.
df_by_temp <- df_weather %>% group_by(Temperature) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_temp, aes(x=as.numeric(Temperature), y=`n_distinct(starttime)`))+
stat_summary(geom="bar", fill="#fb6a4a")+
xlab("Temperature")+
ylab("Average trips per day")+
ggtitle("Average daily trips in different temperatures") +theme_grey(16)
df_by_windspeed <- df_weather %>% group_by(Windspeed) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_windspeed, aes(x=as.numeric(Windspeed), y=`n_distinct(starttime)`))+
stat_summary(geom="bar", fill="#fb6a4a")+
xlab("Windspeed")+
ylab("Average trips per day")+
ggtitle("Average daily trips in different windspeeds")+theme_grey(16)
df_by_precip <- df_weather %>% group_by(Precipitation) %>% summarise(n_distinct(starttime))
ggplot(data = df_by_precip, aes(x=as.numeric(Precipitation), y=`n_distinct(starttime)`))+
stat_summary(geom="bar", fill="#fb6a4a")+
xlab("Precipitation")+
ylab("Average trips per day")+
ggtitle("Average daily trips in different precipitations")+theme_grey(16)
We also compare the average trip duration in severe and unsevere weather, and we can see the difference is minor compared to the difference in the number of trips.
ggplot(data = df_weather,
aes(x=factor(Severe), y=minute,fill= '#fb6a4a'))+
xlab("Severity of weather")+ylab("Average trip duration")+
stat_summary(fun.y="mean", geom="bar")+
ggtitle("Average trip duration for severe and unsevere weather condition") + theme_grey(16)+ theme(legend.position="none")
The next 3 graphs show the relationship between the average trip durations in different windspeed, temperature and precipitation. Due to the scarcity of weather data, we fail to observe a significant pattern in temperature and windspeed, but we can observe the average trip duration drops with precipitation.
ggplot(data = df_weather,
aes(x = Windspeed, y = minute,color='#fb6a4a')) +
scale_x_continuous("Windspeed") +
scale_y_continuous("Average trip duration in minutes") +
ggtitle("Trip duration vs. Windspeed") +
stat_summary(fun.y="mean", geom = "line", size=1)+theme_grey(16)+ theme(legend.position="none")
ggplot(data = df_weather,
aes(x = Temperature, y = minute,color='#fb6a4a')) +
scale_x_continuous("Temperature") +
scale_y_continuous("Average trip duration in minutes") +
ggtitle("Trip duration vs. Temperature") +
stat_summary(fun.y="mean", geom="line", size=1)+ theme_grey(16)+ theme(legend.position="none")
ggplot(data = df_weather,
aes(x = Precipitation, y = minute,color='#fb6a4a')) +
scale_x_continuous("Precipitation") +
scale_y_continuous("Average trip duration in minutes") +
ggtitle("Trip duration vs. Precipitation") +
stat_summary(fun.y="mean", geom = "line", size=1)+ theme_grey(16)+ theme(legend.position="none")
Interactive time series to see if season has an impact on customer flow with Dygraph
We combine over a year time series to see if season has an impact on customer flow with Dygraph. The data is from 201706 to 201806.
library(dygraphs)
library(xts)
library(lubridate)
library(zoo)
df_timeseries <- read.csv('./data/data_2017_2018.csv')
a <- lubridate::mdy_hms(df_timeseries$starttime)
b <- as.Date(df_timeseries$starttime)
b[is.na(b)] <- a[!is.na(a)]
df_timeseries$starttime <- b
df_timeseries <- df_timeseries %>% mutate(time = format(as.Date(starttime), "%Y-%m"))
df_timeseries <- df_timeseries[,c("time","usertype")]
df_subscriber <- df_timeseries %>% subset(usertype == 'Subscriber')
df_customer <- df_timeseries %>% subset(usertype == 'Customer')
df_subscriber <- df_subscriber %>% group_by(time) %>% summarise(Number = n())
df_customer <- df_customer %>% group_by(time) %>% summarise(Number = n())
ts_subscriber <- xts::xts(df_subscriber$Number,as.Date(as.yearmon(df_subscriber$time)))
ts_customer <- xts::xts(df_customer$Number,as.Date(as.yearmon(df_customer$time)))
ts <- cbind(ts_subscriber,ts_customer)
dygraph(ts,main = 'Impact of season',ylab = "Frequency") %>% dySeries('..1',label = 'subscriber ') %>% dySeries('..2',label = 'customer') %>% dyLegend(show = "always", hideOnMouseOut = FALSE,width=400) %>%
dyOptions(colors = RColorBrewer::brewer.pal(3, "Set2"))
From the graph above, we can observe a huge drop during cold weather like Januray. We can infer that season factor indeed has a huge impact on frequency.
5. Analysis about customer flow
library(lubridate)
df_timeslot <- mutate(df,hour = hour(as.POSIXct(starttime)))
df_timeslot <- mutate(df_timeslot,wday = wday(as.POSIXct(starttime)))
df_timeslot <- df_timeslot[,c("hour","wday","minute","start.station.name","usertype")]
timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(fill='#fb6a4a')+ggtitle('Time slot in trip data group by usertype')
timeslot + facet_wrap(~usertype) + xlab('Hours')+ylab('Frequency')+theme_grey(16)
From the above analysis, the overall time slot has two peaks: in 8-9 am and 5-6 pm, which is reasonable since they are the rush hour, which has a large number of passenger flow.
However, the time slot distribution differs a lot with aspect to different usertype and different weekdays.Take the above images as examples, the customer has a higher peak in the time slot which is around 4 pm while the subscribers has two peaks.
timeslot <- ggplot(data=df_timeslot,mapping=aes(as.factor(hour))) +geom_bar(fill='#fb6a4a')+ggtitle('Time slot in trip data group by weekdays')
timeslot + facet_wrap(~as.factor(wday)) +xlab('Hours')+ylab('Frequency')+ theme_grey(16)
Also, the weekdays play a big role in the time duration distribution. On weekdays, we still sees similar two peaks. However, on weekends, we can only observe one peak. Thus, we can think some specific pricing strategy for a specific time slot in the weekdays.